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ABSTRACT 

The  spectral  and  temporal  characteristics  of  American  English 
vowel  and  consonant  sounds  in  a  variety  of  phonetic  contexts  are 
examined  and  compared  with  data  reported  in  the  literature. 
Spectrograms  and  sampled  spectra  (obtained  from  an  analog  filter 
bank  connected  to  a  digital  computer)  were  assembled  for  a  num¬ 
ber  of  monosyllabic  and  disyllabic  utterances  generated  by  three 
talkers,  and  a  variety  of  measurements  were  made  from  these  dis¬ 
plays.  The  characteristics  examined  include  durations  of  vowels, 
durations  of  various  phases  of  consonants  in  prestressed  and 
postsv. ’essed  positions  and  in  clusters,  spectra  of  vowels  and 
diphthongs  and  their  variation  with  time,  spectra  of  consonants 
during  constricted  intervals,  and  time-variation  of  spectra  dur¬ 
ing  the  release  of  consonants.  The  aim  of  the  study  is  not  to 
present  an  exhaustive  acoustic-phonetic  description  of  American 
English  speech  sounds  but  rather  to  indicate  the  kinds  of  acoustic 
properties  that  need  to  be  utilized  in  schemes  for  machine  recog¬ 
nition  of  speech. 
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1.  INTRODUCTION 

The  purpose  of  this  report  is  to  summarize  some  data  on  the  acous¬ 
tic  properties  of  speech  sounds  of  American  English.  The  motiva¬ 
tion  is  to  present  information  that  may  be  useful  to  the  researcher 
who  is  interested  in  machine  recognition  of  speech. 

A  great  deal  of  information  with  regard  to  the  acoustic  properties 
of  speech  sounds  has  been  published  in  journals  and  books  devoted 
to  acoustics  and  phonetics.  This  published  material  is  not,  how¬ 
ever,  directly  usable  by  those  engaged  in  automatic  speech  recog¬ 
nition  for  several  reasons.  A  principal  reason  is  that  the  data 
reported  in  the  past  have  often  not  been  presented  in  sufficiently 
quantitative  form.  In  many  cases,  data  havt  not  been  given  in  ab¬ 
solute  terms,  but  rather  relative  values  of  properties  have  been 
reported.  Furthermore,  numerical  data  are  frequently  obtained 
from  spectrograms  or  other  displays  in  which  a  human  observer  must 
interpret  the  display  in  order  to  make  a  measurement.  In  an  ap¬ 
plication  such  as  machine  recognition  of  speech,  it  is,  of  course, 
essential  that  all  analysis  be  done  by  machine.  Is,  i  by  no  means 
obvious  that  a  machine  can  be  programmed  to  perform  the  same  kinds 
of  analysis  as  a  human  observer. 

For  these  reasons,  our  study  of  the  acoustics  of  speech  has  in¬ 
cluded  not  only  an  examination  of  existing  phonetics  information 
but  also  the  acquisition  and  Interpretation  of  some  new  data.  In 
our  analysis  of  these  data,  we  have  attempted  to  specify  acoustic 
properties  in  a  reasonably  quantitative  way  that  is  amenable  to 
machine  processing. 

Our  point  of  view  in  this  study  is  that  each  phonetic  unit  is 
characterized  by  a  set  of  underlying  attributes  or  features. 
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These  features  have  certain  well-defined  articulatory  and  acoustic 
correlates.  When  a  phonetic  segment  is  concatenated  with  other 
phonetic  segment? ,  the  articulatory  and  acoustic  properties  may 
become  distorted  or  nodifiea  as  a  consequence  of  the  phonetic  en¬ 
vironment.  Thus  a  study  of  the  acoustic  characteristics  of  pho¬ 
netic  segments  must  include  not  only  an  examination  of  the  under¬ 
lying  "undistorted"  attributes  of  the  segments  but  also  a  consid¬ 
eration  of  the  effects  of  the  various  phonetic  environments  in 
which  the  segment  can  appear. 

It  is  probable  that  the  properties  of  a  phonetic  segment  are  modi¬ 
fied  least  by  the  environment  when  the  segment  occurs  in  an  iso¬ 
lated,  stressed,  consonant-vowel  syllable.  Acoustic  invariants 
for  a  consonant  are  more  likely  to  be  observed  when  it  is  the  only 
consonant  preceding  the  stressed  vowel  in  the  syllable.  It  might 
be  hypothesized  that  a  consonant  or  vowel  in  such  a  syllable  would 
have  acoustic  characteristics  that  are  closest  to  the  "ideal" 
characteristics.  In  general,  nonsense  utterances  of  the  form 
/s’CVC/  (C  =  consonant,  V  =  vowel)*  were  used  in  this  study  to  ob¬ 
tain  data  corresponding  to  this  ideal  situation.  The  final  conso¬ 
nant  may  have  some  influence  on  the  characteristics  of  the  preced¬ 
ing  vowel,  but  it  is  possible  to  select  consonants  whose  effect  on 
the  vowel  is  minimal. 

The  modifications  that  occur  in  a  segment  as  a  consequence  of  its 
phonetic  environment  are  of  several  types.  A  stressed  vowel  under¬ 
goes  some  change  if  the  syllable  in  which  it  occurs  is  not  in  iso¬ 
lation  or,  in  general,  is  not  in  the  final  position  of  an  utterance. 


*The  apostrophe  indicates  that  the  stress  is  on  the  second  sylla¬ 
ble.  The  stress  pattern  in  these  nonsense  utterances  is  like 
that  in  the  word  about. 
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Such  effects  are  examined  in  a  preliminary  way  in  this  study  by 
obtaining  data  from  bisyllabic  words  which  have  the  stress  or.  the 
first  syllable.  Furthermore,  stressed  vowels  are  modified  by  the 
final  consonants  in  the  syllables  in  which  they  occur,  and  conse¬ 
quently  it  is  necessary  to  examine  the  characteristics  of  vowels 
with  many  following  consonants.  Consonants  in  prestressed  posi¬ 
tion  can  undergo  appreciable  modification  when  they  occur  in  con¬ 
sonant  clusters,  and  utterances  of  the  form  /df  C  XC i(C  3 )V/  are  used 
to  examine  these  effects.  Likewise,  consonant  characteristics  may 
be  altered  when  they  are  in  poststressed  position.  Some  utterances 
to  illustrate  and,  where  possible,  to  quantify  these  effects  are 
included  in  the  corpus  of  material  in  this  study. 


A  goal  in  any  study  of  the  acoustic  properties  of  speech  sbunds  is 
to  find  a  description  of  a  phonetic  unit  (or,  for  that  matter,  of  a 
group  of  a  small  number  of  units,  such  as  a  syllable)  that  is  suf¬ 
ficient  to  permit  that  unit  to  be  uniquely  identified  without  ref¬ 
erence  to  the  context  in  which  it  appears.  It  must  be  recognized, 
however,  that  tr.e  nature  of  human  speech  precludes  the  achievement 
of  this  ideal  goal.  It  is  common  for  a  given  phonetic  unit  to  be 
so  distorted  in  continuous  speech  that  acoustic  data  in  the  speech 
signal  in  the  vicinity  of  that  unit  are  insufficient  to  provide  an 
unequivocal  identification  of  the  unit.  A  listener  is  able  to  make 
an  interpretation  of  an  utterance  in  which  this  unit  appears  be¬ 
cause  he  is  familiar  with  the  rules  governing  the  sequence  of  pho¬ 
netic  units  (or,  more  precisely,  the  matrix  of  phonetic  features) 
that  can  occur  in  his  language.  In  cases  where  a  phonetic  unit  is 
not  sufficiently  well  defined  acoustically,  the  listener  must  make 
use  of  these  rules  to  infer  the  presence  and  the  identity  of  this 
unit . 
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Thus  a  study  such  as  the  present  one  must  not  be  considered  as  a 
complete  description  of  a  set  of  phonetir  units  that  will  specify 
algorithms  for  recognizing  these  units,  but  rather  must  be  viewed 
as  leading  towards  procedures  for' preprocessing  the  speech  signal, 
to  obtain  as  much  inf  rmation  as  possible  from  the  acoustic  wave¬ 
form.  In  a  speech-recognizing  device,  the  results  of  this  pre¬ 
processing  would  then  be  subjected  to  further  analysis  or  inter¬ 
pretation  by  a  component  in  which  the  lexicon,  the  phonological 
rules,  and  even  certain  syntactic  and  semantic  rules  of  the  lan¬ 
guage  are  stored. 


4 


Report  No.  1669 


Bolt  Beranek  and  Newman  Inc 


2.  PROCEDURES  FOR  PROCESSING  THE  DATA 

The  data  presented  here  were  obtained  from  recordings  of  a  number 
of  monosyllabic  and  blsyllabic  utterances  of  three  talkers  —  two 
males  (KS  and  CW)  and  one  female  (GC).  These  utterances  were  sub¬ 
jected  to  two  kinds  of  preliminary  processing.  First,  wlde-band 
spectrograms  of  all  the  words  were  made  by  using  a  Voiceprlnt 
Laboratories  sound  spectrograph.  The  so-called  logarithmic  fre¬ 
quency  display  was  used,  covering  the  frequency  range  to  7000  Hz. 
Secondly,  all  utterances  were  passed  through  a  specially  designed 
19-channel  filter  bank,  and  the  rectified  and  smoothed  outputs  of 
the  filter  bank  were  sampled  and  quantized  (on  a  logarithmic  scale) 
and  stored  in  the  memory  of  a  PDP-1  computer.  The  amplitude  quan¬ 
tization  is  such  that  each  (logarithmic)  amplitude  step  represents 
3/1*  dE.  The  numerical  ”alues  of  the  sampled  spectra  so  obtained 
were  printed  out  to  permit  detailed  examination  and  analysis. 


The  characteristics  of  the  filter  bank  have  been  described  else¬ 
where  (Stevens  and  von  Bismarck,  1967).  Table  I  lists  the  center 
frequencies  and  bandwidths  of  the  filters.  Up  to  about  3000  Hz, 
the  filters  have  bandwidths  of  360  Hz  and  are  spaced  180  Hz  apart. 
At  higher  frequencies,  the  bandwidths  are  greater,  and  the  fre¬ 
quency  responses  of  adjacent  filters  overlap  at  the  3-dB  points. 
Figure  1  shows  the  frequency-response  curves  for  the  filters,  and 
Fig.  2  shows  the  impulse  response  of  the  low-pass  smoothing  filter 
in  each  channel.  From  Fig.  2  it  can  be  seen  that  the  low-pass  fil¬ 
ters  average  the  rectified  outputs  of  the  bandpass  filters  over  a 
time  interval  of  10-15  msec.  This  averaging  time  has,  of  course, 
an-  important  influence  on  the  observed' characteristics  of  rapidly 
changing  sounds,  such  as  the  onset  of  stop  consonants. 
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TABLE  I.  List  of  filter  center  frequencies  and  bandwidths. 


Filter 

No. 

Lovier 

Cutoff 

(Hz) 

Higher 

Cutoff 

(Hz) 

Center 

Frequency 

(Hz) 

Band¬ 

width 

(Hz) 

1 

100 

440 

260 

360 

2 

260 

620 

440 

360 

3 

440 

800 

620 

360 

4 

620 

980 

800 

360 

5 

800 

1160 

980 

360 

6 

980 

1340 

1160 

360 

7 

1160 

1520 

1340 

360 

8 

13^0 

1700 

1520 

360 

9 

1520 

i860 

1700 

360 

10 

1700 

2060 

1880 

360 

11 

1880 

2240 

2060 

360 

12 

2060 

2420 

2240 

360 

13 

2240 

2600 

2420 

360 

14 

2420 

2780 

2600 

360 

15 

2600 

2960 

2780 

360 

16 

2960 

3560 

3260 

.600 

17 

3560 

4400 

3980 

840 

18 

4400 

5480 

4940 

1080 

19 

5480 

6560 

6020 

1080 
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Measured  frequency  response  of  filter  bank.  Ratio  of 
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Fiq.  2  Impulse  response  of  low-pass  filter  used 
to  smooth  rectified  outputs  of  band-pass 
filters. 


An  example  of  a  spectrogram  and  a  printout  for  one  of  the  utter¬ 
ances  Is  shown  In  Fig.  3-  Major  characteristics  of  the  utterance 
are  observable  on  both  displays:  the  transition  from  the  initial 
/r/  to  the  vowel  (samples  10-20),  the  stop  gap  for  the  /k/  (samples 
37-42),  the  aspiration  following  the  /k/  release  (samples  43-45), 
and  the  final  /s/  (samples  55-80)  can  all  be  identified  on  the 
spectrogram  and  on  the  printout.  The  spectrographic  display  re¬ 
veals  patterns  that  are  interpretable  visually  by  the  human  obser¬ 
ver,  whereas  the  computer  printout  provides  a  display  that  a  human 
can  interpret  only  with  difficulty,  but  which  is  suitable  for  ma¬ 
chine  processing. 
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SAMPLE  NUMBER  (1  STEP=10  MSEC) 

Fig.  3 

Above:  Spectrogram  of  the  word 
raucous  (speaker  KS ) .  The  sam¬ 
ples  Indicated  on  the  abscissa 
represent  instants  of  time  at 
which  the  outputs  of  the  filter 
bank  were  sampled  to  obtain  the 
print-out  at  the  left. 


Left:  Printout  of  19-channel 

f i 1  ter  bank  for  the  word  raucous. 
The  sample  numbers  are  designated 
in  the  left-hand  column,  and  the 
"total"  at  the  right  represents 
the  sum  of  all  filter  outputs. 

The  numbers  represent  the  ampli¬ 
tudes  of  the  filter  outputs  on  a 
logarithmic  scale  in  steps  of 
approximately  3/4  dB. 
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3.  DESCRIPTION  OF  SPEECH  MATERIAL 

The  utterances  used  in  this  study  are  listed  in  Table  II.*  This 
particular  selection  of  utterances  was  made  in  order  to  sample  a 
wide  variety  of  speech  sounds  occurring  in  various  phonetfc  con¬ 
texts.  Some  of  the  material  consists  of  nonsense  syllables  or  bi¬ 
syllables,  and  other  utterances  are  words  of  English.  The  series 
of  syllables  of  the  form  bVb  was  included  to  obtain  basic  data  on 
all  the  vowels  and  diphthongs  in  a  consonantal  environment  that  was 
considered  to  have  a  minimal  influence  on  the  vowel.  The  nonsense 
utterances  of  the  form  /8*CV(C)/  were  used  to  obtain  further  data 
on  vowels  in  different  consonantal  environments  as  well  as  to  ex¬ 
amine  various  consonants  in  prestressed  position  and  in  final  posi¬ 
tion.  Consonant  clusters  are  represented  in  utterances  of  the 
form  /a'CjCjV/  or  /a’CiCjCjV/.  The  bisyllabic  English  words  were 
used  to  provide  examples  of  vowels  and  consonants  in  other  phonetic 
environments,  particularly  unstressed  vowels  and  consonants  in 
poststressed  position.  Some  real  monosyllabic  and  bisyllabic  words 
were  included  to  provide  examples  of  consonant  clusters  in  post- 
stressed  position. 

The  material  in  this  report  is  organized  into  four  parts,  each 
part  being  concerned  with  a  particular  class  of  speech  sounds: 


•The  phonetic  symbols  used  in  Table  II  and  throughout  the  text  are 
(with  some  minor  modifications)  the  symbols  of  the  International 
Phonetic  Association.  Examples  of  words  containing  the  nonobvious 
phonetic  symbols  are  the  following:  /i/  (beet),  /i/  (bit),  /e/ 
(bait),  /£/  (bet),  /»/  (bat),  /a/  (cot),  /a/  (cut),  /o/  (bought), 
/o/  (boat),  /v/  (foot),  /u/  (boot),  /ai/  (kite),  /au/  (couch), 

/ol/  (boil),  / t  /  (bird) ,  /a/  (about),  /0/  (thin),  /«/  (then), 

/!/  (shoe),  /z/  (beige),  /rj/  (sing),  /m/  (wHTch),  /C/  (chin), 

/]/  (jump). 
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stressed  vc^els,  consonants  in  prestressed  position,  unstressed 
vowels,  and  consonants  in  pcststressed  (or  unstressed)  positions. 


Stressed  vowels  are,  in  some  sense,  the  sounds  that  are  of  most 
importance  in  an  utterance,  and  it  is  almost  essential  that  any 
speech-recognition  scheme  be  able  to  locate  and  at  least  partially 
identify  these  sounds.  Next  in  order  of  importance  are  the  con¬ 
sonants  an*-  consonant  clusters  that  precede  stressed  vowels. 

These  might  be  regarded  as  the  prototype  consonants;  the  acoustic 
characteristics  for  consonants  in  this  position  should  appear  with 
clarity  and  should  not  be  subject  to  the  distortions  and  omissions 
that  are  typical  of  consonants  in  other  phonetic  environments. 

The  vowels  and  consonants  in  unstressed  positions  are  often  greatly 
influenced  by  factors  such  as  rate  of  talking  and  state  of  the 
talker.  These  sounds  probably  provide  cues  for  recognition  that 
are  less  reliable  than  those  associated  with  vowels  and  consonants 
in  stressed  positions. 


The  utterances  listed  in  Table  II  and  generated  by  three  talkers 
are  not  described  exhaustively  in  this  report.  The  purpose  of  the 
report  is  rather  to  examine  some  highlights  of  the  data.  An  at¬ 
tempt  is  made  to  discuss  all  of  the  speech  sounds,  but  not  neces¬ 
sarily  to  consider  the  detailed  characteristics  of  these  sounds  in 
all  possible  phonetic  contexts  that  might  be  of  interest.  Most  of 
the  data  presented  here  are  for  one  speaker  (KS),  but  examples 
from  the  other  speakers  are  often  shown  either  to  corroborate  the 
results  for  speaker  KS  or  to  indicate  the  kind  of  variability  that 
might  be  expected  from  one  speaker  to  another. 
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TABLE  II.  List  of  utterances  usud  In  phonetic  study.  The  numbers 
are  simply  codes  for  Identifying  the  utterances,  par¬ 
ticularly  for  purposes  of  computer  analysis.  Nonsense 
utterances  are  described  in  terms  of  phonetic  symbols. 


Words 

are 

identified 

by  their 

orthography 

No. 

Utterance 

No. 

Utterance 

No. 

Utterance 

No. 

Utterance 

101 

a'plp 

125 

a'  tat 

149 

a'  kak 

173 

a '  sos 

102 

a'pip 

126 

a'tAt 

150 

a '  kAk 

174 

a '  sos 

103 

a' pep 

127 

a '  taft 

151 

a '  kaf'k 

175 

a '  sae  s 

104 

a'pep 

128 

a 'did 

152 

s’gig 

176 

a '  sas 

105 

a '  pup 

129 

a  'did 

153 

s'gag 

177 

8  '  SAS 

106 

a' pup 

130 

a '  ded 

154 

s'gug 

178 

9  '  s3^  S 

107 

a '  pop 

131 

a'ded 

155 

a'fil’ 

179 

8  '  ziz 

108 

a  'pop 

132 

a  'dud 

156 

a'faf 

180 

8'  ZI  Z 

109 

a'p»p 

133 

a '  dud 

157 

a'fuf 

181 

8'zez 

110 

a*  pap 

134 

a'  dod 

158 

a' viv 

182 

8 '  zez 

111 

a*  pAp 

135 

a'd^d 

159 

a'vov 

183 

a '  zuz 

112 

a'pafp 

136 

a  'daed 

160 

a '  vuv 

184 

a '  zuz 

113 

a 'bib 

137 

a '  dad 

161 

a'0i0 

185 

9  '  ZOZ 

114 

a  'bob 

138 

a '  dAd 

162 

a '  0a0 

186 

9  '  ZOZ 

115 

s' bub 

139 

9  'd3*  d 

163 

a  '0u0 

187 

a '  zaez 

116 

a '  tit 

140 

8  'kik 

164 

a  'did 

188 

a '  zaz 

117 

a '  ti  t 

141 

8 '  ki  k 

165 

a  'dad 

189 

a '  zaz 

118 

a'tet 

142 

a  'kek 

166 

a  'du3 

190 

a '  ztz 

119 

a '  tet 

143 

a  'kek 

167 

a '  sis 

191 

b  ’sis 

120 

a '  tut 

144 

a  'kuk 

168 

a '  si  s 

192 

t v  v 

a '  sas 

121 

a 'tut 

145 

a  'kuk 

169 

a '  ses 

193 

1 v  u 

a  1  sus 

122 

a '  tot 

146 

a '  kok 

170 

a '  ses 

194 

V  v 

9  '  ziz 

123 

a'  tot 

147 

a  'kok 

171 

a '  sus 

195 

V  V 

8 '  zaz 

124 

a '  tae  t 

148 

a  'kaek 

172 

a '  sus 

196 

V  V 

9  '  ZUZ 
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TABLE  II  (  continued) 


No. 

Utterance 

No. 

Utterance 

No. 

Utterance 

No. 

Utterance 

197 

o'  hi 

224 

a'  ji 

251 

a'  bra 

278 

sober 

198 

a' ha 

225 

a'  ja 

252 

a'bla 

279 

robin 

199 

a'  hu 

226 

3'jU 

253 

a '  dra 

280 

rabid 

200 

a'mim 

227 

a'  Mi 

254 

a'gla 

281 

baby 

201 

a' mam 

228 

s'  MO 

258 

a' skra 

282 

bottle 

202 

a' mum 

229 

a 'mu 

256 

a ' spla 

283 

water 

203 

a'nin 

230 

a?  cic 

257 

bib 

284 

button 

204 

a'nin 

231 

1 v  v 

a'  cac 

258 

bib 

285 

seated 

205 

a'nen 

232 

1  v  v 

a '  cue 

259 

beb 

286 

detail 

206 

a'nen 

233 

V  V 

a'jij 

260 

beb 

287 

modal 

207 

a' nun 

234 

V  V 

9  jo  J 

261 

baeb 

288 

raider 

208 

a*  nun 

235 

V  V 

a'  juj 

262 

bob 

289 

hidden 

209 

a' non 

236 

a*  spa 

263 

bob 

290 

beaded 

210 

a' nan 

237 

a'  rta 

264 

bob 

291 

body 

211 

a'naen 

238 

a '  ska 

265 

bub 

292 

local 

212 

a*  nan 

239 

a '  sma 

266 

bub 

293 

poker 

213 

a'nAn 

240 

a '  sna 

267 

bAb 

294 

reckon 

214 

a'naPn 

241 

a '  kwa 

268 

bf  b 

295 

raucous 

215 

a'  111 

242 

a '  two 

269 

baib 

296 

cocoa 

216 

a'lal 

243 

a '  pro 

270 

baub 

297 

legal 

217 

a '  lul 

244 

a  'tra 

271 

boib 

298 

sugar 

218 

a '  rlr 

245 

a '  kra 

272 

apple 

299 

wagon 

219 

a  'rar 

246 

a  'pla 

273 

paper 

300 

ragged 

220 

a  'rur 

247 

a '  kla 

274 

open 

301 

pogo 

221 

a '  wirj 

248 

a '  sla 

275 

rapid 

302 

suffer 

222 

a  'war] 

249 

a '  swa 

276 

happy 

303 

muffin 

223 

a '  wur) 

250 

a'fla 

277 

table 

304 

hovel 
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TAB! t  T  (  continued.) 


No. 

Utterance 

No . 

Utterance 

No. 

Utterance 

No . 

Utterance 

305 

cover 

318 

mohair 

331 

flyer 

344 

bald 

306 

author 

319 

lemon 

332 

nowhere 

345 

heart 

307 

pathos 

32- 

famous 

333 

kitchen 

346 

hard 

308 

father 

321 

dinner 

334 

ketchup 

347 

saunter 

309 

fathom 

322 

peanut 

335 

region 

348 

launder 

310 

lesser 

323 

single 

336 

edges 

349 

seltzer 

m 

essay 

324 

singer 

337 

gaunt 

350 

sudser 

312 

dozen 

325 

killer 

338 

bond 

351 

Gloucester 

313 

busy 

326 

pallid 

339 

lots 

352 

filter 

314 

bushf  j. 

327 

horrid 

340 

sods 

353 

builder 

315 

nation 

328 

very 

341 

frost 

354 

martyr 

316 

measure 

329 

tower 

342 

fizzed 

355 

harder 

317 

vision 

330 

seaweed 

343 

fault 
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4.  STRESSED  VOWELS 

There  are  various  schemes  for  classifying  the  vowels  of  American 
English.  In  our  discussion  here,  we  postulate  that  there  are  15 
vowels  that  ran  occur  in  stressed  position.  Three  of  these  (/ai/ 
/au/  and  /oi/)  are  diphthongs,  one  (/3V)  is  a  retroflex  vowel,  and 
the  remaining  11  can  be  categorized  in  terms  of  articulatory  fea¬ 
tures  in  the  manner  shown  in  Table  III. 


TABLE  III. 

Features  o'  the  vowels 
symbol  +  indicates  the 
the  symbol  -  indicates 

of  American 
presence  of 
the  absence 

English.  The 
a  feature,  and 
of  a  feature. 

i  I  e  £  ae 

a  a 

o  o  u  u 

back  -----  +  +  +  +  +  + 

high  +  +  + 

low  +  +  +  +  - 


rounded  + 

tense  +  -  +  --  +  -  +  +  —  + 


The  diphthongs  /ai/,  /au/,  and  /oi/  can  be  classed  as  tense  vowels. 

On  the  basis  of  acoustical  data,  we  shall  observe  that  certain 

•»  * 

vowels  followed  by  /l/  and  /r/  also  ha,ve  some  of  the  characteristics 
of  diphtiiongs. 

In  English  there  is  a  tendency  for  some  of  the  tense  vowels  to  be 
diphthongized;  that  is,  the  vowel  quality  changes  with  time  through 
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the  vowel.  This  diphthongization  is  particularly  evident  in  the 
vowels  /e/  and  /o/,  and  is  less  apparent  (but  still  observable) 
for  the  vowels  /i/  and  /u/.  Thus,  of  the  nine  tense  vowels,  four 
terminate  in  an  /i/-like  position  (/ai,  oi,  e,  i/),  three  termi¬ 
nate  in  an  /u/-like  position  (/au,  o,  u/),  and  the  two  low  back 
vowels  (/a/  and  /o/)  are  not  diphthongized. 

The  lax  vowels  /I,  e,  ae  ,  A,  u  /  are  generally  shorter  than  the 
tense  vowels  (although  /ae/  may  be  an  exception).  Throughout 
their  duration  these  vowels  tend  to  drift  toward  an  open  or  schwa 
vowel  configuration  (designated  by  /a/),  with  the  possible  excep¬ 
tion  of  /a/,  which  is  already  close  to  that  configuration.  It  is 
noteworthy  that  a  ^tressed)  lax  vowel  is  always  followed  by  a 
consonant  in  English,  whereas  a  tense  vowel  may  appear  in  final 
position  without  a  following  consonant. 

Some  of  the  effects  just  noted  can  be  observed  in  Fig.  h,  which 
shows  spectrograms  of  five  of  the  vowels  generated  by  one  speaker 
in  the  environment  /bVb/,  For  example,  the  diphthongization  of  /u/ 
is  manifested  in  the  falling  second  formant,  while  in  /e/  the 
second  and  third  formants  are  rising  throughout  the  vowel. 

Durations  of  the  vowels  in  the  environment  /bVb/  are  given  in 
Table  IV.  These  durations,  which  were  measured  from  spectrograms, 
represent  the  time  from  release  of  the  initial  /b/  to  the  onset  of 
the  stop  gap  in  the  final  /b/.  It  is  evident  that  the  tense  vowels 
are  the  longest,  with  the  exception  of  the  lax  vowel  /ae/  and  the 
retroflex  vowel  / / / ,  which  have  durations  comparable  to  those  of 
tense  vowels.  More  complete  data  on  vowel  durations  for  other 
consonantal  environments  in  a  nonsense-syllable  frame  have  been 
reported  by  House  (1961).  As  is  well  known,  vowel  durations 
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depend  to  some  extent  on  the  following  consonant,  being  longer  for 
final  voiced  consonants  and  shorter  for  final  voiceless  consonants. 
The  durations  given  in  Table  IV  cannot  be  expected  to  remain  in¬ 
variant  in  utterances  that  are  several  syllables  long.  In  natural 
speech,  the  vowel  durations  will  usually  be  shorter,  particularly 
when  the  stressed  vowel  is  followed  by  a  syllable  containing  an 
unstressed  vowel,  but  the  durations  for  different  vowels  will  tend 
to  maintain  the  same  relative  values  as  those  shown  in  Table  IV. 


TABLE  IV.  Durations  of  stressed  vowels  in  the  environment 
/bVb/.  Averages  for  three  talkers  generating 
one  utterance  for  each  vowel. 


Vowel 

Duration 

(msec) 

Vowel 

Duration 

(msec) 

Vowel 

Durati on 
(msec ) 

i 

300 

au 

330 

i 

170 

e 

300 

ai 

320 

e 

200 

a 

330 

oi 

280 

as 

330 

0 

310 

3' 

270 

A 

180 

o 

290 

J 

170 

u 

260 

The  durations  of  the  stressed  vowels  in  some  of  the  bisyllabic 
words  (with  stress  on  the  Initial  syllable)  were  also  measured. 

As  noted  earlier,  the  words  included  some  consonant  clusters  in 
syllable-final  position.  For  the  clusters  containing  /l/  and  /r/ 
(aa  in  the  words  filter  and  harder ),  if  is  not  possible  to 
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establish  a  boundary  between  the  vowel  and  the  consonant  for  pur¬ 
poses  of  duration  measurements.  For  such  utterances,  therefore, 
durations  of  the  entire  vowel-consonant  combination  (e.g.,  /II/ 
and  /or/)  were  measured.  The  results  of  these  measurements  are 
snown  in  Table  V.  For  purposes  of  these  measurements,  the  onset 
of  the  vowel  is  considered  to  be  at  the  instant  of  consonant  re¬ 
lease.  Individual  data  for  each  of  the  three  speakers  are  given 
in  order  to  provide  an  indication  of  the  variability  to  be  ex¬ 
pected  from  speaker  to  speaker.  The  words  are  arranged  in  order 
of  increasing  mean  vowel  duration. 

Comparison  of  the  vowel  durations  of  single  stressed  vowels  in  bi- 
syllabic  words  (in  Table  V)  with  the  corresponding  vowels  in  the 
monosyllabic  utterances  /bVb/  from  Table  IV  Indicates  that  the  du¬ 
ration  of  a  stressed  vowel  is  often  considerably  shortened  In  a 
multisyllabic  utterance.  For  some  vowels,  the  duration  In  the  bi- 
syllabic  context  is  as  little  as  one-third  of  the  duration  In  a 
monosyllable.  When  the  vowel  /a/  is  followed  by  /r/ ,  or  when  /!/ 
or  /£/  are  followed  by  /l/,  the  total  duration  of  the  vowel- 
sonorant  combination  is  comparable  to  that  of  a  tense  vowel  when 
that  vowel  is  followed  by  an  obstruent  (i.e.,  a  consonant  produced 
with  complete  closure  or  with  noise  at  the  constriction).  Thus 
there  is  a  tendency  for  these  vowel-sonorant  clusters  to  behave 
like  tense  vowels  or  diphthongs  as  far  as  their  durations  are  con¬ 
cerned. 

In  general,  then,  the  durations  of  stressed  vowels,  diphthongs,  or 
vowel-sonorant  combinations  may  be  as  short  as  80  msec  and  as  long 
as  380  msec.  The  shorter  vowels  are  the  single  lax  vowels  in  bi- 
syllabic  words  with  stress  on  the  first  syllable,  while  the  longer 
ones  are  tense  vowels  or  diphthongs  in  monosyllabic  isolated  words. 
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TABLE  V. 

Measured 

vowel  durations  {in  milliseconds)  from 

vowels  and  vowel 

-consonant  combinations  occur- 

ring  in 

disyllabic  words  spoxen 

in  isolation. 

The  underscored 

items  indicate 

the  segments 

whose  durations 

were  measured. 

The  words  are 

arranged 

in  order  of  increasing 

duration  value. 

Speaker 

Mean 

KS 

CW 

GC 

sugar 

82 

82 

83 

82 

hidden 

83 

75 

98 

85 

seated 

98 

105 

128 

110 

suffer 

112 

112 

105 

110 

button 

120 

120 

90 

110 

busy 

112 

135 

143 

130 

ketchup 

135 

142 

113 

130 

Gloucester 

142 

150 

158 

150 

seltzer 

150 

165 

143 

153 

filter 

120 

173 

188 

160 

saunter 

158 

158 

165 

160 

rapid 

150 

158 

188 

165 

paper 

158 

172 

165 

165 

bottle 

173 

188 

158 

173 

martyr 

165 

173 

195 

178 

harder 

165 

173 

203 

180 

modal 

165 

195 

188 

183 

raucous 

202 

188 

165 

185 

water 

188 

195 

180 

188 

launder 

202 

158 

225 

195 

robin 

195 

202 

240 

212 

builder 

217 

240 

240 

232 

father 

240 

225 

263 

243 

20 
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A  comprehensive  model  that  accounts  for  the  variation  in  duration 
for  stressed  vowels  in  various  environments  (including  longer  words 
and  phrases,  which  are  not  examined  in  this  study)  has  yet  to  be 
developed.  The  duration  of  a  vowel  in  a  multisyllabic  utterance  is 
evidently  influenced  by  the  "rhythm"  of  the  utterance,  including 
the  timing  between  syllabic  nuclei,  and  the  factors  that  determine 
this  timing  of  these  gross  aspects  of  an  utterance  are  not  known  at 
present.  Informal  observations  indicate,  however,  that  in  a  multi¬ 
syllabic  utterance  the  time  intervals  between  vowels  with  stress 
are  much  less  variable  than  the  durations  of  individual  stressed 
vowels . 

Spectra  of  the  15  vowels  uttered  in  the  context  /bVb/  by  one  of  the 
speakers  are  shown  in  Fig.  5-  These  spectra  are  actually  smoothed 
outputs  of  the  19-channel  filter  bank  referred  to  earlier.  The  sol¬ 
id  curves  for  each  of  these  vowels  represent  spectra  taken  at  about 
70—100  msec  after  the  release  of  the  initial  /b/.  In  cases  where 
the  vowels  appear  to  be  diphthongized,  or  to  dri^t  towards  a  schwa 
configuration,  one  or  more  additional  spectra  are  shown.  These  are 
samples  at  later  instants  of  time  in  the  vowels.  For  each  vowel, 
the  spectrum  samples  are  identified  by  number.  These  numbers  simply 
designate  which  10-msec  interval  was  examined,  and  the  numbers  be¬ 
gin  at  an  arbitrary  point  just  prior  to  the  onset  of  the  utterance, 
as  shown  in  the  sample  of  the  printout  displayed  in  Fig.  3* 

For  four  of  the  vowel  spectra  (/i,  as  ,  a,  u/),  arrows  are  drawn  to 
indicate  the  frequencies  of  the  lowest  two  or  three  formants  as 
measured  from  the  spectrograms.*  It  is  evident  that  the  spectra 

*The  formant  frequencies  for  these  and  other  vowels  are  in  the 
ranges  reported  by  Peterson  and  Barney  (1952),  who  did  an  ex¬ 
haustive  study  of  the  formant  frequencies  of  a  number  of  vowels 
in  the  context  hVd,  spoken  by  many  different  people. 
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FILTER  NUMBER 


Spectra  of  15  stressed  vowels  in  the  environ¬ 
ment  / b - b /  obtained  from  19-channe1  filter 
bank  Curves  are  labeled  with  sample  number 
(10-msec  sampling  interval)  beginning  at  an 
arbitrary  time  prior  to  onset  of  utterance. 
Solid  lines  represent  spectra  sampled  70-100 
msec  after  onset  of  initial  consonant.  Dash¬ 
ed  lines  represent  spectra  sampled  later  in 
the  vowel  for  cases  where  there  is  an  appre¬ 
ciable  shift  in  spectrum.  Spectra  are  sampled 
at  three  points  throughout  the  diphthongs. 

For  the  vowels  /  i  /  /ae  /  /a/  /u/,  the  small  ar¬ 
rows  indicate  the  frequencies  of  fn- mants  as 
measured  from  spectrograms.  Data  c>\2  for 
speaker  KS . 
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obtained  from  the  relatively  broad  filters  in  the  19-channel  fil¬ 
ter  bank  have  peaks  in  the  vicinity  of  formant  frequencies.  How¬ 
ever,  when  two  formants  are  sufficiently  close  togetner,  the  spec¬ 
trum  representation  from  the  filter  bank  may  show  only  a  single 
broad  peak.  We  regard  this  poor  resolution  of  spectral  peaks  not 
to  be  a  drawback  in  the  19-channel  spectrum  analyzer;  there  is  some 
evidence  (Fant,  1959;  Fujimura,  1 9 6 7 )  that  two  closely  spaced  for¬ 
mants  tend  to  be  interpreted  perceptually  in  the  same  way  as  a 
single  energy  concentration,  whether  the  two  formants  are  FI  and  F2 
(Formant  1  and  Formant  2),  as  for  back  vowels,  or  F2  and  F3,  as  for 
front  vowels.  Furthermore,  theoretical  considerations  show  that 
the  relative  amplitudes  of  different  regions  of  vowel  spectra  are 
dependent  on  the  formant  frequencies,*  and  hence  provide  informa¬ 
tion  regarding  the  locations  of  formants  (Fant,  1956;  Stevens  and 
House,  1961). 

Several  gross  properties  of  the  data  are  apparent  from  Fig.  5.  All 
the  vowels  are  characterized  by  a  major  energy  concentration  at  low 
frequencies,  the  peak  being  in  one  of  the  three  filters  2,  3,  and 
4  (spanning  the  frequency  range  260  to  980  Hz).  This  region  corre¬ 
sponds,  of  course,  to  the  first-formant  frequency,  although  in  the 
case  of  back  vowels  /n,  0,  A,  o,  u,  u/  the  low-frequency  peak  is  a 
consequence  of  both  the  first  and  the  second  formants,  which  are 
close  together  for  these  vowels. 

The  front  vowels  (i,  I,  e,  E ,  ae  ) always  have  one  or  more  additional 
energy  peaks  at  high  frequencies  (filter  number  8  —  1520  Hz  —  or 

*These  amplitude  relations  indicate  that  an  increase  in  the  fre¬ 
quency  of  a  given  formant  causes  an  increase  in  the  amplitude 
of  the  spectrum  peaks  corresponding  to  formants  located  at  fre¬ 
quencies  above  that  formant.  Also,  when  two  formants  move  close 
together  in  frequency,  the  amplitude  of  the  spectrum  in  the  vi¬ 
cinity  of  the  frequencies  of  these  formants  increases. 
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higher)  and  a  significant  energy  minimum  between  the  low-  and  high- 
frequency  peaks.  This  energy  minimum  for  this  speaker  is  always  in 
the  range  of  filters  6  and  7  (1160  to  1340  Hz),  and  the  minimum 
value  is  always  at  least  10  units  (7-5  dB)  below  the  higher  fre¬ 
quency  peak.  The  amplitude  of  the  high-frequency  filter  with  maxi¬ 
mum  energy  is  no  more  than  20  units  (15  dB)  below  the  peak  ampli¬ 
tude  at  low  frequencies  for  this  speaker. 

The  back  vowels  have  no  such  deep  valley  in  the  spectrum.  If  an 
energy  minimum  exists  .  this  frequency  range  (filters  6  and  7),  it 
is  a  rather  shallow  minimum,  and  the  peak  amplitude  of  the  higher- 
frequency  peak  is  well  over  20  units  below  that  of  the  low-frequency 
peak. 

The  various  front  vowels  are  distinguished  from  one  another  by  the 
position  of  the  low-frequency  energy  peak  and  by  the  distance  (in 
frequency)  between  the  central  energy  minimum  and  the  adjacent 
high-  and  low-frequency  peaks.  For  /i/,  the  low  peak  is  at  fil¬ 
ter  2  (440  Hz)  and  the  higher-frequency  concentration  is  at  fil¬ 
ter  10  (i860  Hz)  and  above.  At  the  other  extreme,  the  low  front 
vowel  /ae/  has  the  low-frequency  peak  at  filter  4  (800  Hz)  and  the 
higher  peak  at  filter  8  (1520  Hz). 


Distinctions  among  the  various  back  vowels  are  made  primarily  on 
the  basis  of  the  frequency  width  and  position  of  the  major  low- 
frequency  energy  concentration ,  which  is  a  consequence  of  the  first 
and  second  formants.  This  energy  concentration  is  lowest  in  fre¬ 
quency  for  /u/  and  highest  in  frequency  for  /a/  and  /a/,  with 
/U  oo/  lying  between.  The  high  frequencies  (above  filter  9  at 
1700  Hz)  are  sufficiently  weak  that  they  do  not  play  an  important 
role  in  distinguishing  between  tne  back  vowels. 
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The  diphthongized  vowel  /e/  moves  from  a  spectrum  shape  intermedi¬ 
ate  between  /i/  and  /ae/  toward  an  /i/-like  configuration.  The 
vowel  /o/,  on  the  other  hand,  has  an  initial  spectrum  shape  inter¬ 
mediate  between  /u/  and  /a/  and  then  glides  toward  a  /u/-llke  con¬ 
figuration.  These  diphthongs  in  this  phonetic  context  of  an  iso¬ 
lated  CVC  utterance  are  characterized  by  a  decrease  in  amplitude 
of  the  low-frequency  peak  as  the  glide  proceeds  toward  the  extreme 
high  position  characteristic  of  the  /i/  or  /u/.  Likewise,  the 
vowels  /i/  and  /u/  snow  some  diphthong! zation  toward  more-extrem/ 
configurations;  for  both  vowels  there  is  a  slight  drop  in  fre¬ 
quency  of  the  low-frequency  peak  as  well  as  a  decrease  in  its  am¬ 
plitude. 

The  diphthongs  /ai,  au,  oi/  show  the  expected  motions  between  two 
vowel  configurations.  For  /ai/,  the  spectrum  near  the  beginning 
is  like  that  of  the  vowel  /a/;  it  moves,  at  first  slowly  and  then 
more  rapidly,  toward  an  /i/  or  / i /  spectrum.  The  most  obvious 
effect  for  th!c  oiphthong  is  the  Introduction  of  the  midfrequency 
minimum  as  the  vowel  glides  from  a  back  configuration  to  a  front 
configuration.  The  combination  /oi/  has  similar  characteristics. 
The  diphthong  /au/  is,  of  course,  a  back  vowel  throughout  its 
length,  and  the  movement  is  primarily  a  shift  of  the  low-frequency 
peak  in  the  downward  directon,  with  a  resulting  decrease  in  am¬ 
plitude  in  the  high-frequency  range. 

The  acoustic  data  for  the  tense  vowels  and  diphthongs  provide  evi¬ 
dence,  therefore,  that  one  set  of  vowels  (/i,  e,  ai,  oi/)  is  diph¬ 
thongized  with  a  final  glide  toward  an  /i/-like  spectrum,  whereas 
another  set  (/ u,  o,  au/)  has  a  final  glide  toward  a  /u/-like  spec¬ 
trum,  as  discussed  earlier.  The  low  vowels  /a/  and  /o/  are  the 
only  tense  vowels  in  American  English  that  do  not  have  one  of 
these  two  glides. 
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The  lax  vowels  /!,  e,  a.  »/  appear  also  to  exhibit  a  change  n 
spectrum  as  a  function  of  time.  For  each  of  these  vowels,  the 
spectrum  towards  the  end  of  the  vowel  tends  to  have  a  second- 
formant  peak  in  the  vicinity  of  filter  8  (1520  Hz).  (For  /, 
and  to  some  extent  for  M,  this  tendency  Is  more  apparent  In 
utterances  with  the  final  consonant  /d/;  the  final  consonant  a  - 
ways  has  some  Influence  on  the  vowel  spectrum  near  the  end  of  the 
vowel.)  Such  a  second-formant  frequency  Is  characteristic  o 

schwa  vowel  /a/ • 

Thus  any  stressed  vowel  in  which  there  is  a  drift  toward 
schwa  position  must  he  a  lax  vowel.  No  tense  vowel  has  this  prop¬ 
erty.  It  may  be  significant  also  that  there  is  a  reduction 
plitude  of  the  low-frequency  peak  (by  about  10  units)  during  the 
drift  toward  the  schwa  position,  but  there  is  no  appreciable  rop 
in  amplitude  of  the  high  frequency  peak.  In  the  A/,  for  examp  e, 
the  low  peak  decreases  in  amplitude  by  11  units,  and  shifts 
ward  slightly  In  frequency,  whereas  the  higher  peak  changes  verj 

little  In  amplitude. 


By  the  criteria  discussed  above,  /*/  would  be  classified  as  a 
back  vowel  since  it  does  not  have  a  pronounced  midfrequency  min  ■ 
mum.  The  secondary  peak  at  filter  6,  which  Is  of  appreciable  am- 
plitude  relative  to  the  low-frequency  maximum  for  this  vowel, 
serves  to  distinguish  /*/  from  the  other  back  vowels.  The  rela¬ 
tively  high  amplitude  of  this  peak  Is  presumably  due  to  the  prox 
imity  of  the  second  and  third  formants  in  the  frequency  range 
1000-1600  Hz. 


Th-  vowel  spectra  for  the  other  two  speakers  exhibit  characteris 
tics  similar  to  those  shown  In  Fig.  5.  Spectra  of  three  of  the 
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the  vowels  for  the  three  speakers  are  displayed  in  Fig.  6.  In 
all  cases,  the  vowels  were  in  the  environment  /bVb/;  the  spectra 
were  sampled  70—100  msec  following  the  release  and  again  at  a 
later  time  in  cases  where  the  spectrum  changed  appreciably.  In 
the  case  of  the  vowel  /i/,  all  spectra  have  a  low-  and  a  high- 
frequency  energy  concentration,  with  a  broad  valley  between  these 
peaks.  There  is  a  slight  diphthongizaticn  toward  a  lower  first- 
formant  frequency  and  a  higher  second-formant  frequency  for  all 
speakers,  but  there  are  appreciable  differences  in  the  shape  of 
the  high-frequency  spectrum  (above  about  2000  Hz).  For  the  vowel 
/i/,  the  shift  toward  a  schwa  vowel  (higher  first  formant,  lower 
second  formant)  is  evident  for  all  speakers,  but  is  more  pro¬ 
nounced  for  some  speakers  than  for  others.  The  vowel  /a/  has  the 
broad  low-frequency  peak  for  each  of  the  three  speakers. 


The  vowel  spectra  shown  in  Figs.  5  and  6  are  influenced  to  some 
extent  by  the  phonetic  environment  in  which  the  stressed  vowels 
occur.  Effects  of  consonantal  environment  on  vowel  formant  fre¬ 
quencies  have  been  shown  previously  (Lehiste  and  Peterson,  1961; 
Stevens  and  House,  1963),  and  indicate  that  adjacent  consonants 
tend  to  influence  lax  vowels  more  than  tense  vowels.  An  illustra¬ 
tion  of  the  effect  of  phonetic  environment  for  a  lax  vowel  is 
given  in  Fig,  7,  which  illustrates  the  range  of  spectra  observed 
in  the  middle  of  the  stressed  vowel  /I/  in  seven  different  non¬ 
sense  syllables  and  bisyllabic  words.  The  range  tends  to  be 
greater  at  high  frequencies  than  at  low  frequencies,  presumably 
since  small  shifts  in  a  formant  frequency  have  a  greater  influence 
on  spectrum  amplitudes  above  that  frequency  than  below  it.  Two 
examples  of  more  deviant  spectra  of  /i/  are  also  shown  in  Fig.  7. 
The  following  consonants  /rj/  and  /!/  appear  to  have  a  strong  ef¬ 
fect  on  this  vowel. 
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Fig.  6  Spectra  of  three  stressed  vowels  (ir  the 
environment  /b-b/)  are  compared  for  three 
speakers.  Data  obtained  from  19-channel 
filter  bank.  See  legend  of  Fig.  5. 
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The  upper  graph  shows  the 
range  of  spectra  for  the 
stressed  vowel  /i /  for 
seven  different  consonan¬ 
tal  environments  in  non-, 
sense  syllables  and  in  bi- 
syllabic  words.  The  lower 
two  graphs  are  examples  of 
spectra  of  the  same  vowel 
when  it  is  modified  appre¬ 
ciably  by  the  final  conso¬ 
nant.  Speaker  KS . 


Report  No.  1669 


Bolt  Beranek  and  Newman  Inc 


The  data  of  the  type  shown  in  Figs.  5—7  and  in  Tables  IV  and  V 
suggest  the  possibility  of  developing  algorithms  that  would  be 
useful  in  separating  one  class  of  vowels  from  another.  Although 
the  purpose  of  this  report  is  not  to  present  such  algorithms,  it 
may  be  of  interest  to  suggest  some  possibilities  in  order  to  in¬ 
dicate  the  kind  of  results  that  might  be  expected.  Consider,  for 
example,  the  separation  of  vowels  into  the  classes  front  and  baok 
(The  diphthongs  are  omitted  for  the  purposes  of  this  analysis.) 
For  the  speakers  examined  in  this  study,  front  vowels  are  always 
characterized  by  a  spectral  minimum  in  the  range  of  filters  6  and 
7  (980—1520  Hz),  and  the  low-  and  high-frequency  spectral  maxima 
are  roughly  at  equal  distances  (in  hertz)  on  either  side  of  this 
minimum.  For  back  vowels,  the  high-frequency  spectral  maximum, 
if  it  exists,  is  of  much  lower  amplitude  than  the  corresponding 
maximum  for  front  vowels. 

An  algorithm  that  would  roughly  take  these  facts  into  account  is 
the  following: 

(1)  Look  at  the  outputs  of  filters  5  through  8.  If  a  minimum 
does  not  occur  in  filters  6  and  7  in  this  region,  the  vowel 
is  a  back  vowel.  (This  procedure  identifies  all  but  a  few 
of  the  back  vowels  examined  in  the  /bVb/  utterances  of  this 
study.)  If  such  a  minimum  does  occur,  record  its  value  A 
in  filter  a. 

(2)  Find  the  spectral  maximum  below  filter  a.  Say  It  is  in 

filter  b.  Find  filter  0  an  equal  distance  above  a;  i.e., 

c-a  =  a-b .  Determine  the  maximum  A  in  one  of  the  three 

n 

filters  c  ±  1.  Compute  A  -A  .  If  this  difference  is  less 

n  m 

than  8  units  (about  6  dB),  then  the  vowel  is  a  back  vowel. 
Otherwise  it  is  a  front  vowel. 
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Examination  of  a  number  cf  the  vowels  (all  the  vowels  in  the  /bVb/ 
context  and  a  number  of  others)  reveals  that  this  algorithm  divides 
the  stressed  vowels  into  two  classes  as  desired,  with  no  overlap. 

Similar  algorithms  could  be  developed  for  other  vowel  features. 

The  features  high  and  low,  for  example,  would  be  Identified,  at 
least  m  part,  by  the  position  of  the  low-frequency  peak. 

In  summary,  then,  the  attributes  that  distinguish  one  stressed 
vowel  from  another  must  include  (1)  duration,  (2)  spectrum  char¬ 
acteristics,  and  (3)  how  the  spectrum  changes  with  time.  The  kind 
of  display  provided  by  the  19-channel  filter  bank  seems  to  contain 
enough  information  to  permit  the  stressed  vowels  to  be  distin¬ 
guished  from  one  another  using  these  types  of  criteria. 
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5.  CONSONANTS  IN  PRESTRESSED  POSITION:  SINGLE  CONSONANTS 

There  are  about  23  consonants  that  can  appear  in  prestressed  posi¬ 
tion  in  American  English.  One  procedure  for  classifying  these 
consonants  in  terms  of  binary  features  is  given  in  Table  VI.  This 
is  a  slightly  modified  version  of  the  system  proposed  by  Chomsky 
and  Halle  (1968).  Place  of  articulation  for  the  consonants  in 
this  system  is  specified  by  the  features  anterior  and  coronal. 

The  feature  anterior  indicates  that  the  consonant  is  generated  in 
front  of  the  palato-alveolar  region  of  the  mouth,  and  coronal  des¬ 
ignates  a  consonant  generated  with  the  blade  of  the  tongue.  The 
term  aspiration  applies  to  a  consonant  for  which  noise  energy  is 
generated  at  the  glottal  opening  following  the  consonantal  re¬ 
lease.  A  sonorant  consonant  is  generated  with  no  major  obstruc¬ 
tion  to  the  air  flow  above  the  vocal  cords.  In  discussing  the 
acoustic  characteristics  of  various  consonants,  the  sounds  will 
be  grouped  roughly  into  classes  suggested  by  the  feature  descrip¬ 
tion  of  Table  VI.  Other  features  can  be  used  to  characterize  cer¬ 
tain  of  these  consonants,  but  those  features  will  not  be  referred 
to  in  this  report. 


The  acoustic  information  necessary  for  the  identification  of  these 
consonants  is  of  two  kinds:  (1)  the  characteristics  during  the 
constricted  consonant  interval  preceding  the  release  into  the 
vowel;  and  (2)  acoustic  events  at  the  release  of  the  constricted 
interval  and  during  the  50-  to  100-msec  interval  in  which  there 
is  a  transition  into  the  vowel.  For  consonants  in  intervocalic 
or  in  final  position,  the  transition  from  the  preceding  vowel  in¬ 
to  the  consonant  also  provides  information  about  the  consonant. 

We  consider  first  some  data  obtained  within  the  constricted  con¬ 
sonant  interval. 
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5.1  Closure  Interval:  stop  and  nasal  consonants 

The  stop  and  nasal  consonants  can  be  rcughly  placed  in  the  same 
class,  since  they  all  have  a  reasonably  steady-state  interval 
followed  by  a  discontinuous  change  at  the  Instant  of  release  into 
the  following  vowel.  Spectrograms  giving  examples  of  each  class 
of  consonants  for  one  speaker  are  shown  in  Fig.  3. 

During  the  voiceless  stop  consonants,  the  constricted  interval  is 
silent,  whereas  in  voiced  stops  there  may  be  some  vocal-cord  vi¬ 
bration  within  this  interval.  The  filter  bank  gives  spectra  of 
the  form  shown  in  Fig.  9  during  this  v4oic^.*g  interval.  There  i  - 
energy  essentially  only  in  the  lowest  two  frequency  bands,  and 
the  amplitude  in  the  lowest  filter  is  20—30  units  below  the  peak 
amplitude  (maximum  amplitude  in  any  one  of  filters  2,  3,  4)  dur¬ 
ing  the  following  stressed  vowel.  The  spectrum  is  more  or  less 
the  same,  independent  of  which  stop  consonant  is  involved,  but 
the  overall  amplitude  depends  to  some  extent  upon  the  speaker,  the 
consonant,  and  the  following  vowel. 


For  nasal  consonants,  the  spectrum  within  the  closure  interval  is 
of  higher  intensity  and  is  characterized  by  relatively  greater 
energy  at  higher  frequencies,  as  shown  in  Fig,  10.  The  spectral 
maximum  is  in  filter  1  or  in  filter  2,  and  is  usually  about  10—15 
units  below  the  peak  amplitude  in  the  following  vowel.  There  are 
no  significant  and  consistent  differences  in  the  spectra  for  /m/ 
and  for  /n/,  and  the  following  vowel  does  not  have  an  appreciable 
effect  on  the  spectrum,  at  least  when  it  is  displayed  in  this 
relatively  gross  manner.  All  nasal  consonants  appear  to  have  rel¬ 
atively  weak  spectral  energy  in  the  vicinity  of  filter  4  (800  Hz) 
relative  to  the  energy  at  lower  frequencies  (Fujimura,  1962). 
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Spectrograms  illustrating  properties  of  stop  and  nasal 
consonants.  Speaker  KS. 
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Fig.  10  Spectra  during  closure  for  nasal  consonants 
in  prestressed  position. 
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This  attribute  distinguishes  nasal  consonants  from  /l/,  as  noted 
later.  The  detailed  shape  of  the  spectra  for  the  nasals  at  high 
frequencies  may  differ  considerably  from  one  speaker  to  another, 
as  comparison  of  /m/  for  CW  and  GC  in  Pig.  10  demonstrates. 


TABLE  VII.  Durations  of  closure  Intervals  for  stop  and  nasal 
consonants  preceding  stressed  vowels.  Averages 
over  three  vowel  environments  /i  a  u/  and  over 
three  talkers. 


Consonant 

Average  Duration 
(msec) 

P 

130 

t 

120 

k 

110 

b 

130 

d 

120 

g 

120 

V 

c 

110 

V 

J 

110 

m 

120 

n 

130 

1 

130 

Durations  of  the  closure  intervals  for  stop  and  nasal  consonants 
in  the  environment  /a'CV/  have  been  measured  from  spectrograms. 
These  durations,  averaged  over  three  vowel  environments,  are 
listed  in  Table  VII.  There  are  no  significant  effects  of  vowel 
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environment,  and  there  are  only  slight  differences  in  duration 
between  the  three  classes  of  consonants  (voiceless  stops,  voiced 
stops,  and  nasals).  Individual  utterances  may  have  stop  gap  dura¬ 
tions  that  differ  from  the  average  values  by  as  much  as  25  per¬ 
cent.  The  average  duration  for  the  /l/  closure  is  also  shown  in 
Table  VII,  since  in  many  respects  /l/  can  be  classified  with  stops 
and  nasals.  These  differences  in  duration  between  the  various 
classes  of  consonants  are  much  more  marked  for  the  consonants  in 
poststressed  position,  as  will  be  observed  later.  Likewise,  the 
durations  of  stop  gaps  in  prestressed  position  may  be  considerably 
shorter  (by  as  much  as  50  percent)  than  the  values  given  in 
Table  VII  when  the  consonants  are  generated  in  the  context  of  a 
longer  speech  sample. 


5.2  Constricted  interval:  fricative  consonants 

The  acoustic  spectra  within  V' iceless  fricative  consonants  are 
always  characterized  by  high-frequency  energy,  although  in  the 
case  of  the  consonants  /f/  and  /&/  this  energy  may  be  weak  (Hughes 
and  Halle,  1956;  Heinz  and  Stevens,  1961).  Spectrograms  of  the 
four  voiceless  fricative  consonants  preceding  the  vowel  /a/  are 
shown  in  Fig.  11.  Typical  spectra  of  these  consonants  are  plotted 
in  Fig.  12.  Spectra  for  the  other  two  speakers  have  similar  gross 
characteristics.  As  is  well  known,  the  lowest  major  energy  con¬ 
centration  in  the  spectrum  for  /s/  is  in  the  frequency  range 
2000—3000  Hz  (filters  13  to  14  in  the  example  shown  in  Fig.  12), 
and  there  are  further  energy  peaks  at  still  higher  frequencies. 

For  /s/,  on  the  other  hand,  the  increase  in  spectral  energy  does 
not  begin  until  filter  16—1/  (3560-^00  Hz).  The  consonants  /f/ 
and  /9/  have  some  high-frequency  energy  only  in  filter  19 
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Fig.  11  Spectrograms  illustrating  properties  of  voiceless 
fricative  consonants.  Speaker  K5 . 
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Spectra  during  constricted  interval  for 
voiceless  fricative  consonants  in  pre¬ 
stressed  position.  Speaker  KS. 
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(around  6000  Hz),  but  the  overall  intensity  for  these  consonants 
is  quite  low.  The  fricative  /f/  has  some  weak  energy  in  the  low- 
frequency  range,  but  there  is  essentially  no  low-frequency  energy 
for  the  remaining  voiceless  fricative  consonants;  tnls  fact  pro¬ 
vides  a  simple  and  reliable  way  of  separating  voiceless 
from  all  voiced  sounds. 

At  high  frequencies,  the  spectra  for  voiced  fricatives  are  very 
similar  to  the  spectra  of  their  voiceless  cognates.  The  voicing 
is  manifested  by  low-frequency  energy;  the  amplitude  of  tne  rs 
filter  output  seems  to  be  always  the  greatest,  and  there  is  re  a- 
tlvely  small  output  for  all  low-frequency  filters  above  the  sec¬ 
ond.  Examples  of  spectrograms  and  spectra  for  voiced  frca  ves 
are  shown  in  Pigs.  13  and  14,  respectively.  The  amplitude  of  the 
first  filter  output  is  consistently  15-30  units  below  the  peak 
amplitude  of  the  following  vowel,  depending  upon  the  speaker  to 
some  extent.  One  of  the  three  speakers  examined  in  this  study 
(GW)  tends  to  generate  many  continuant  consonants  with  a  less 
constricted  vocal  tract,  and  consequently  the  high  frequencies 
(above,  say,  500  Hz)  for  voiced  fricatives  are  not  as  weak  as  or 
the  other  speakers.  An  example  of  a  /v/  spectrum  for  this  speaker 

is  shown  in  Fig.  1^* 


Durations  of  the  constricted  intervals  for  voiceless  and  voiced 
fricatives  in  prestressed  position  are  somewhat  longer  than  the 
durations  of  the  closure  intervals  for  the  corresponding  stop  con 
sonants.  Examples  of  these  durations  obtained  from  several  vowel 
environments  are  given  in  Table  VIII.  The  durations  of  voiced 
fricatives  are  consistently  about  50  msec  less  than  those  of 
voiceless  fricatives,  but  the  effects  on  duration  of  place  of 
articulation  are  relatively  small. 
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Fig.  13  Spectrograms  illustrating  properties  of  voiced 
fricative  consonants.  Speaker  KS. 
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TABLE  VIII.  Durations  of  constricted  intervals  for  fricative 
consonants  preceding  stressed  vowels.  Averages 
over  three  vowel  environments  /i  a  u/  and  over 
three  speakers. 


Consonant 

Average  Duration 

(msec ) 

f 

180 

e 

180 

s 

180 

V 

s 

200 

V 

130 

a 

130 

z 

140 

V 

z 

150 

5.3  Constricted  interval:  liquids  and  glides  (sonorant,  non¬ 
nasal  consonants) 

The  consonants  /w  j  r  1/  in  initial  prestressed  position  are  all 
characterized  by  an  interval  of  50-100  msec  in  which  the  i'irst- 
forman'  frequency  (frequency  of  low  energy  peak)  is  low  and  in 
which  the  amplitude  at  low  frequencies  is  lower  than  in  the  fol¬ 
lowing  vowel.  Spectrograms  of  utterances  containing  these  sono¬ 
rant  consonants  are  shown  in  Fig.  15.  Examples  of  spectra  taken 
in  the  middle  of  this  constricted  interval  (between  initial  schwa 
and  stressed  vowel  in  utterances  of  the  type  /o'CV(C)/  are  given 
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In  Fig.  16.  All  of  these  spectra  have  a  frequency  peak  either  In 
filter  1  or  In  filter  2,  with  sharply  dropping  energy  above  this 
spectral  peak.  The  rate  of  decrease  In  energy  for  filters  3 
through  6  Is  greatest  for  the  glide  /j/,  and  least  for  the  liquids 
/r/  and  /l/.  The  consonants  /r/  and  /!/  have  two  formants  In  the 
frequency  range  up  to  1100  Hz,  and  consequently  the  low-frequency 
peak,  which  Is  a  consequence  of  the  first  two  formants.  Is  broader 
than  for  the  /j/,  which  has  only  one  formant  at  low  frequencies. 
For  the  glide  /w/,  the  second  formant  is  lower  (it  is  at  about 
700  Hz)  than  for  /r/  and  /l/,  and  consequently  the  drop  in  energy 
at  the  upper  side  of  the  low  peak  is  more  rapid.  For  two  of  the 
three  speakers  studied,  the  initial  consonant  /r/  seems  to  exhibit 
a  small  secondary  peak  at  filter  5  or  6  ( 9 80 — 1160  Hz),  probably 
because  the  low  third  formant  helps  to  boost  the  amplitude  of  the 
second-formant  peak. 

The  peak  amplitude  at  low  frequencies  for  all  of  these  sounds  is 
5-15  units  (iJ-12  dB)  below  the  peak  amplitude  of  the  adjacent 
vowel.  This  reduction  in  energy  during  the  consonant  is  a  conse¬ 
quence  of  the  constricted  vocal-tract  configuration  associated 
with  the  consonant;  this  constricted  configuration  gives  rise  to 
a  low-frequency  first  formant,  and  it  can  be  shown  that  the  ampli¬ 
tude  of  the  first-*formant  peak  goes  down  as  the  frequency  of  the 
first  formant  decreases  (Fant,  1956).  There  may  be  also  some  re¬ 
duction  in  the  output  of  the  vocal  cords  in  the  consonantal  inter¬ 
val  . 


Both  /r/  and  /w/  have  essentially  no  energy  in  the  frequency  range 
above  filter  10  (1880  Hz),  whereas  for  /l/  and  /j/  there  are  energy 
pejaks  at  high  frequencies.  (For  one  of  the  speakers,  there  is 
more  high-frequency  energy  in  these  consonants  than  for  the  other 
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FREQUENCY  (kHZ) 


Fig.  16  Spectra  sampled  during  middle  of  constricted 
Interval  for  liquids  and  glides  In  the  pre¬ 
stressed  position.  Speaker  KS. 
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two  speakers.)  These  peaks  are  more  pronounced  for  the  /  j  / ,  and 
the  biggest  peak  seems  to  be  ai  ound  filter  16  (about  3000  Hz). 

For  /l/  the  weak  high-frequency  peak  is  in  the  vicinity  of  fil¬ 
ter  13  (about  2^00  Hz).  The  absolute  level  of  high-frequency 
energy  probably  does  not  provide  an  important  cue  for  distinguish¬ 
ing  among  these  consonants.  Of  greater  importance,  perhaps,  is 
the  transition  between  the  consonant  and  the  following  vowel,  and 
in  particular  the  contrast  in  high-frequency  energy  level  between 
the  consonant  and  the  vowel,  as  discussed  later  in  this  Section 
of  the  report. 


5.4  Release  and  transitions:  stop  and  nasal  consonants 

The  release  from  a  stop  or  nasal  consonant  into  the  following 
stressed  vowel  is  characterized  by  discontinuities  in  the  ampli¬ 
tudes  in  some  frequency  regions,  as  the  spectrograms  in  Fig.  8 
have  demonstrated,  and  by  transitions  of  the  formants  into  steady- 
state  positions  characteristic  of  the  following  vowel.  The  spec¬ 
trograms  of  Fig.  C  show,  for  example,  that  the  second  formant  in 
the  syllable  /ba/  undergoes  a  rising  transition  following  release 
of  the  consonant,  whereas  for  /da/  and  /go/  tnere  is  a  falling 
second-formant  transition.  At  the  output  of  the  19-channel  ana¬ 
lyzer,  the  amplitude  discontinuities  are  less  obvious  that  on  the 
spectrograms,  since  the  smoothing  filters  in  the  analyzer  have 
relatively  long  time  constants  (10-20  msec).  The  discontinuous 
changes  are  expected  to  occur  over  an  interval  as  short  as  10—20 
msec . 
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Although  a  criterion  for  iuentifying  a  consonant  as  being  a  stop 
or  a  nasal  has  not  been  wo;  :ed  out  in  quantitative  terms,  it  is 
possible  to  state  the  nature  of  such  a  criterion.  The  require¬ 
ment  (in  the  case  of  a  consonant  in  prestressed  position)  is  that 
there  should  be  an  interval  of  the  order  of  50—100  mse''  or  more 
in  duration  in  which  there  are  no  rapid  changes  in  any  of  the 
spectrum  channels.  Following  this  interval  there  should  be  a  dis 
continuous  change  in  most  of  the  spectrum  channels  in  which  mea¬ 
surable  energy  exists.  Thus,  for  example,  the  amplitude  in  chan¬ 
nel  5  for  the  utterance  ,3’ma/,  shown  in  Fig.  17,  would  satisfy 
the  requirement,  whereas  a  contour  like  that  shown  for  the  utter¬ 
ance  /a'wa/  would  not  satisfy  the  requirement.  Even  though  the 
rate  of  change  of  amplitude  is  comparable  in  the  two  cases,  the 
glide  /w/  does  not  have  a  sufficiently  long  steady-state  interval 
preceding  this  change.  A  criterion  such  as  this  would,  inciden¬ 
tally,  place  the  liquid  /l/  in  the  same  class  as  stops  and  nasals 
a  categorization  that  is  appropriate  on  other  grounds. 


Nasal  consonants  can,  of  course,  usually  be  distinguished  from 
stop  consonants  by  the  nature  of  the  spectrum  during  r-he  closure 
intervai ,  as  noted  earlier,  although  the  characteristics  of  the 
release  into  the  following  vowel  also  provide  important  cues  for 
identifying  nasals  as  opposed  to  voiced  stops.  Voiced  and  voice¬ 
less  stop  consonants  can  often  oe  differentiated  c  the  basis  of 
the  very  low-frequency  energy  that  may  exist  during  the  closure 
interval  for  voiced  stops  but  not  for  voiceless  stops.  This  is 
not  a  reliable  indicator,  however,  since  frequently  a  voice  bar 
does  not  exist  for  a  voiced  stop,  particularly  when  it  occurs  in 
initial  position. 
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Fig.  17  Graph  of  amplitude  of  channel  5  vs  time 
for  the  utterances  /a'ma/  and  /a’wa/. 
The  jump  in  amplitude  from  sample  31  to 
32  would  be  called  a  "discontinuity"  in 
the  case  of  ./a'mo/  because  it  is  pre¬ 
ceded  by  a  long  interval  in  which  the 
amplitude  is  essentially  constant 


A  more  reliable  procedure  for  identifying  aspirated  (voiceless) 
stops  as  opposed  to  voiced  stops  is  to  detect  the  presence  of  the 
aspiration  noise  which  always  follows  the  release  of  a  voiceless 
stop  in  English.  This  aspiration  interval  can  be  most  readily 
differentiated  from  voicing  by  observing  the  outputs  of  the  lowest 
three  or  four  filters  (region  of  the  first  formant  of  the  following 
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vowel).  The  amplitude  in  this  frequency  range  is  always  less  than 
that  for  a  vowel,  by  20  or  more  units  (15  dB  or  more).  This  prop¬ 
erty  of  voiceless  stops  is  illustrated  in  Fig.  18,  which  compares 
a  plot  of  the  output  of  filter  2  as  a  function  of  time  for  several 
voiced  and  voiceless  stops.  The  discontinuities  at  the  consonan¬ 
tal  release  and  again  at  the  onset  of  voicing  for  the  voiceless 
stops  are  clearly  observable.  The  duration  of  aspiration  follow¬ 
ing  the  release  of  voiceless  stops  is  in  the  range  50—100  msec 
(Lisker  and  Abramson,  1964).  The  duration  of  frication  noise 
following  the  release  of  the  affricate  consonants  /c/  and  /)/  is 
in  this  range  also.  The  length  of  aspiration  for  the  voiceless 
stops  tends  to  be  smallest  for  /p/  and  greatest  for  /k/,  although 
these  differences  are  not  observable  in  the  few  examples  shown 
in  Fig.  18.  In  the  case  of  the  voiced  stops,  the  Increase  in 
level  immediately  following  the  release  is  quite  rapid  since  voic¬ 
ing  commences  immediately  upon  release  or  shortly  thereafter. 

This  rate  of  increase  in  level  at  low  frequencies  appears  to  be 
greater  for  the  labial  stop  /b/  than  for  /d/  or  /q/  (as  discussed 
later) . 


Flace  of  articulation  for  stop  and  nasal  consonants  is  determined 
primarily  by  the  detailed  acoustic  events  at  the  consonantal  re¬ 
lease  and  during  the  50-  to  100-msec  interval  following  the  re¬ 
lease.  Differences  in  these  transitions  between  consonant  and 
vowel  are  attributable,  of  course,  to  the  fact  that  the  articula¬ 
tory  mechanism  must  perform  movements  between  the  different  conso¬ 
nant  configurations  and  the  following  vowel  configurations,  and 
the  durations  of  these  movements  may  be  50  msec  or  more,  if  the 
consonant  starting  point  is  different  (as  it  is  with  the  labials 
/b  p  m/  as  opposed  to  the  dentals  /d  t  r./  as  opposed  to  the  velars 
/q  k/)  then  the  transitions  will  be  different. 
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A  measurement  procedure  that  will  reliably  separate  labials,  den¬ 
tals,  and  velars  has  not  yet  been  developed.  It  is  known  that  the 
transitions  of  the  formants  between  consonant  release  and  vowel, 
particularly  the  second- formant  transition,  provide  important  cues 
for  consonant  identification  (Liberman,  Delattre,  Cooper,  and 
Gerstman,  195*0.  The  spectral  and  temporal  characteristics  of  the 
burst  of  noise  that  is  often  generated  at  the  constriction  at  the 
instant  of  release  also  are  important  indicators  of  place  of  ar¬ 
ticulation  for  the  consonant.  The  spectrograms  of  Fig.  8  illus¬ 
trate  some  of  these  differences  ir  formant  transitions  and  in  the 
burst,  but  the  differences  are  subtle  and  may  be  difficult  to  de¬ 
tect  by  machine.  Furthermore,  the  directions  of  the  formant  tran¬ 
sitions  for  a  given  consonant  may  aepend  to  some  extent  on  the 
following  vowel. 


An  illustration  of  the  kind  of  procedure  that  will  be  necessary  to 
identify  these  consonants  is  presented  in  Fig.  19-  This  Figure 
shows  the  outputs  of  three  of  the  filters  (filter  5  at  980  Hz, 

8  at  1520  Hz,  and  16  at  3260  Hz)  as  a  function  of  time  in  the 
100-msec  interval  immediately  following  the  release,  for  the  syl¬ 
lables  /ba/,  /da/,  /ga/,  /na/ i  and  /ma/.  For  /ba/,  there  is  a 
tendency  for  the  amplitude  increases  in  filters  8  and  16  to  lag 
behind  that  in  filter  5.  In  the  case  of  /do/,  on  the  other  hand, 
filter  16  shows  the  earliest  initial  onset,  with  the  amplitudes  in 
filters  8  and  5  rising  at  successively  later  times.  Filter  8  shows 
the  most  rapid  initial  rise  for  the  syllable  /go/,  and  this  rate  of 
increase  Is  more  abrupt  than  for  /ba/’  and  /da/.  Thus,  there  is  a 
tendency  for  the  initial  onset  of  energy  to  be  at  low  frequencies 
for  /b/,  at  high  frequencies  for  /d/,  and  in  the  midfrequency 
range  for  /g/  (Stevens,  1967).  The  Initial  /n/  in  /na/  shows  some 
of  the  characteristics  of  /d/:  there  is  negligible  energy  in 
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Fig.  19 

Outputs  of  f i 1 ters  5 ,  8 , 
and  16  during  lima  inter¬ 
val  immediately  preceding 
and  following  the  release 
for  the  syllables  indi¬ 
cated,  Speaker  KS . 
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filter  16  in  this  case,  but  the  onset  of  energy  in  filter  8  pre¬ 
cedes  that  in  filter  5»  This  is  not  the  case  for  the  syllable 
/ma/.  For  the  /g/  in  prestressec!  position,  there  is  an  initial 
burst  of  noise  energy  of  about  30-msec  duration;  evidence  of  this 
noise  burst  can  be  seen  in  channels  5  and  8  of  Fig.  19.  The  ini¬ 
tial  burst  is  briefer*  for  /d/  (about  10  msec)  and  of  higher  fre¬ 
quency  (as  seen  in  channel  16  of  Fig.  19),  but  there  is  essential¬ 
ly  no  noise  burst  at  the  onset  of  /b/.  These  properties  of  the 
burst  can  also  be  observed  qualitatively  in  the  spectrograms  of 
Fig.  8.  The  picture  presented  in  Fig.  19  will  change  somewhat 
depending  upon  the  following  vowel,  and  would  probably  show  the 
effects  more  clearly  if  the  averaging  times  of  the  smoothing  fil¬ 
ters  in  the  analyzer  were  shorter. 


For  initial  voiceless  stop  consonants,  the  time-varying  spectral 
patterns  immediately  following  the  release  are  similar  in  some  re¬ 
spects  to  those  of  the  voiced  stops.  Figure  20  shows  spectra 
sampled  at  60-msec  intervals  following  release  of  each  of  the 
voiceless  stops  with  the  following  vowel  /a/.  In  the  case  of  the 
labial  stop  /p/,  the  spectrum  in  the  aspiration  interval  immedi¬ 
ately  follovjing  the  release  Indicates  weak  energy  with  no  pro¬ 
nounced  high-intensity  spectr  1  peaks.  After  onset  of  voicing, 
the  vowel  spectrum  remains  reasonably  stable.  For  /t/  and  /k/, 
on  the  other  hand,  there  are  pronounced  spectral  peaks  in  the 
noise  interval,  at  nigh  frequencies  for  /t/  and  in  the  midfre¬ 
quency  range  for  /k/.  The  spectrum  changes  that  occur  after 
voicing  onset  indicate  that  there  is  a  rising  transition  of  the 
first  formant  and  a  falling  transition  of  the  second  formant  dur¬ 
ing  the  /a/  following  these  two  consonants.  The  spectrum  charac¬ 
teristics  during  the  aspiration  interval  for  /k/  are  very  much 
dependent  upon  the  following  vowel. 
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Fig.  20  Spectra  (obtained  from  the  19-channel  filter 
bank)  sampled  at  60-msec  intervals  following 
release  of  the  stop  consonants  in  the  sylla¬ 
bles  /pa/,  /ta/,  and  /ko/.  The  first  spec¬ 
trum  is  sampled  about  20  msec  after  conso¬ 
nant  release,  the  second  spectrum  occurs 
about  20  msec  after  voicing  onset  and  the 
third  is  in  the  middle  of  the  vowel. 

Speaker  KS. 
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5.5  Release  and  transitions:  liquids  and  glides 

As  noted  above,  the  consonants  /w  J  r/  in  the  prestressed  position 
do  not  exhibit  a  steady-state  interval  during  the  constriction, 
but  rather  are  characterized  by  continuous  change.  This  property 
is  illustrated  in  the  spectrograms  of  the  consonants  shown  in 
Fig.  15.  Examples  of  several  filter  outputs,  plotted  as  a  func¬ 
tion  of  time,  for  /w  ,1  r  1/  in  the  environment  /a’Ca/  are  dis¬ 
played  in  Fig.  21.  For  /wa/  the  amplitude  rise  for  filter  8  occurs 
later  than  for  filter  5  (as  it  does  in  the  syllable  /ba/),  whereas 
for  /ja/  the  situation  is  reversed.  As  noted  earlier,  there  is  a 
reasonably  long  steady-state  interval  for  /!/  before  the  rapid 
transition  into  the  following  vowel.  No  such  steady-state  region 
exists  for  /r/,  but  with  /r/  there  is  a  greatly  delayed  onset  of 
the  amplitude  of  filter  16  relative  to  lower-frequency  filters 
(as  a  consequence  of  the  rising  transition  of  the  third  formant, 
which  is  always  characteristic  of  initial  /v/) . 

When  the  consonants  /w  J  r  1/  occur  in  other  vowel  environments, 
data  similar  to  those  shown  in  Fig.  21  are  obtained,  but  the  re¬ 
lations  between  timing  of  filter  outputs  may  not  always  be  as  clear 
as  those  in  v.hs  Figure. 


5.6  Summary  of  characteristics  of  consonants  in 
prestressed  position 

The  acoustic  attributes  of  consonants  in  prestressed  position  can 
be  conveniently  summarized  in  terms  of  the  features  listed  in 

Table  VI. 
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Fig.  21  Outputs  of  several  filters  (as  indicated) 
for  liquid  and  glide  consonant,  in  the  en¬ 
vironment  /e'Ca/.  Speaker  KS  (in  all  cases). 
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Stop  consonants  are  characterised  by  a  closure  interval  within 
which  the  properties  of  the  signal  at  low  frequencies  do  not  change 
appreciably  and  in  which  there  is  negligible  high-frequency  energy 
(above,  say,  200C  Hz).  In  the  case  of  nonnasal  stops,  there  is  es¬ 
sentially  no  sound  energy  in  this  interval  above  about  500  Hz. 

This  closure  time  is  followed  by  an  almost  discontinuous  change 
coincident  with  release  of  the  articulatory  closure  into  the  fol¬ 
lowing  vowel.  This  abrupt  change  occurs  in  some  frequency  ranges, 
but  not  necessarily  at  all  frequencies,  particularly  in  the  case  of 
the  sonorant  stops  (i.e.,  nasals  and  /!/). 


For  sonorant  consonants,  voicing  continues  with  appreciable  energy 
through  the  closure  interval,  and  the  spectral  maximum  during  this 
interval  is  always  in  the  first  or  second  filter  of  the  19-channel 
analyzer  used  in  this  study,  i.e.,  below  440  Hz.  The  spectral 
energy  in  this  frequency  range  is  several  decibels  below  that  of 
the  following  vowel.  A  sonorant  has  no  high-frequer.oy  noise  energy. 

The  feature  voicing  implies  that  periodicities  ana  low-frequency 
energy  continue  through  the  closure  interval.  In  the  case  of 
voiced  stop  consonants,  vocal-cord  vibration  may  be  weak  or  absent 
during  the  closure  interval,  but  voicing  commences  almost  immedi¬ 
ately  upon  release  of  the  stop.  Thus,  for  a  segment  to  be  voiced, 
an  interval  of  high-frequency  noise  (30  msec  or  more  in  duration) 
must  not  occur  in  the  absence  of  low-frequency  periodicities. 


As  Table  VI  indicates,  a  fricative  consonant  has  the  features 
-stop,  -80  orant.  Thus,  such  a  consonant  has  high-frequency  noise 
energy  during  the  closure  interval,  and  may  or  may  not  be  voiced 
during  this  time. 
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The  acoustic  correlate  of  the  feature  aspiration  (which  in  English 
applies  only  to  stop  consonants)  is  an  interval  of  50—100  msec  of 
noise  that  occurs  after  release  of  the  stop  and  before  voicing  oc¬ 
curs.  This  aspiration  noise  has  negligible  energy  in  the  low- 
frequency  (first -formant)  region. 

A  nasal  consonant  is  a  sonorant  and  a  stop  consonant,  and  hence 
has  the  attributes  of  both  of  these  features.  A  distinguishing 
attribute  of  a  nasal  is  a  relative  lack  of  energy  in  the  frequency 
region  around  800  Hz,  whereas  there  is  strong  spectral  energy  at 
lower  frequencies  and  there  may  be  spectral  peaks  above  800  Hz. 

The  features  of  place  of  articulation  for  consonants  (the  features 
anterior  and  coronal  in  Table  VI)  are  difficult  to  describe  simply 
in  terms  of  common  properties  that  are  valid  for  fricative,  stop, 
and  sonorant  consonants.  For  fricative  consonants,  the  noise  spec¬ 
trum  during  the  constricted  interval  provides  important  cues  for 
place  of  articulation.  For  other  classes  of  consonants,  place  of 
articulation  is  determined  by  the  way  in  which  the  characteristics 
of  the  signal  change  with  time  in  the  few  tens  of  milliseconds 
following  consonant  release.  Postdental  consonants  tend  to  have 
strong  initial  high-frenuency  energy  in  this  interval,  whereas  for 
velars  the  energy  onset  is  in  the  midfrequency  range.  Further 
study  is  needed  to  provide  a  more  precise  specification  of  fhe 
acoustic  correlates  of  place  of  articulation,  particularly  for 
stop  consonants. 
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6.  CONSONANT  CLUSTERS  IN  PRESTRESSED  POSITION 

The  consonant  clusters  that  can  occur  in  initial  position  in  Eng¬ 
lish  can  be  divided  into  two  classes:  (1)  /s/  followed  by  a  stop 
(or  nasal)  consonant  (/'sp/,  /sn/,  etc.);  and  (2)  a  liquid  or  glide 
preceded  by  a  stop  or  fricative  consonant  (/pr/,  /fl/,  /gr/,  /tw/, 
etc.).  A  third  class  consists  of  triplets  that  are  members  of 
both  these  classes,  e.g. ,  /str/,  /spl/.  Examples  of  spectrograms 
of  these  clusters  in  the  environment  /a * C 1 C2 ( C3 )a/  are  shown  in 
Fig.  22. 

In  the  case  of  clusters  with  initial  /s/,  the  fricative  has  more 
or  less  the  same  spectral  characteristics  as  an  initial  fricative 
without  an  adjacent  consonant.  Its  duration  tends  to  be  somewhat 
shorter,  however,  particularly  when  ic  precedes  a  stop  consonant. 
Measured  durations  of  the  noise  in  /s/  in  such  clusters  are  given 
in  Table  IX.  The  noise  interval  appears  to  be  the  shortest  for 
the  three-segment  clusters  and  longest  preceding  /l/  and  /w/.  A 
stop  consonant  following  /s/  has  very  little  aspiration  noise  fol¬ 
lowing  the  release.  Values  of  the  duration  of  the  noise  interval 
between  release  of  the  stop  and  the  onset  of  voicing  are  about 
30  msec  for  velar  consonants  and  less  for  other  stops.  The  dura¬ 
tions  of  the  stop  gap  and  of  the  nasal  murmur  (for  the  clusters  /sm/ 
and  /sn/)  are  also  given  in  Table  IX.  These  durations  are  consid¬ 
erably  shorter  than  the  corresponding  durations  when  the  stops  and 
nasals  occur  as  single  prestressed  consonants.  The  discontinuous 
changes  at  the  release  of  the  stop  and  nasal  consonants  are  similar 
to  those  discussed  previously. 

Table  IX  also  shows  the  approximate  durations  of  the  constricted 
interval  in  .sonorant  consonants  when  they  are  preceded  by  /s/. 
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Again,  this  duration  (average  of  50  msec)  is  shorter  than  the 
corresponding  duration  when  the  sonorant  is  the  only  consonant 
preceding  a  stressed  vowel. 


TABLE  IX.  Durations  of  noise  in  fricative  / s /  and  of  stop 

gap  or  sonorant  murmur  in  following  consonant  for 
various  consonant  clusters  in  the  environment 
/a'sC(C)a/.  Average  values  for  three  speakers. 


Cluster 

Duration  of  Noise 

(msec ) 

Duration  of  Stop  Gap 
or  Sonorant  Murmur 
(msec) 

sp,  st,  sk 

120 

80 

sm,  sn 

150 

50 

si,  sw 

160 

-50 

skr,  spl 

110 

90  (stop  gap) 

When  a  voiceless  stop  consonant  is  followed  by  a  ^lide  or  liquid, 
the  duration  of  the  closure  interval  is  slightly  less  than  that 
for  a  stop  immediately  preceding  a  vowel  but  the  duration  of  the 
aspiration  ir  considerably  greater.  The  durations  of  aspiration 
measured  from  spectrograms  of  the  utterances  containing  such  clus¬ 
ters  range  from  80  to  110  msec,  compared  with  50  to  100  msec  when 
the  consonants  appear  singly.  The  duration  of  the  voiced  segment 
of  the  glide  or  liquid  preceding  the  vowel  is  quite  brief.  After 
the  onset  of  voicing,  a  rapid  transition  toward  the  following 
vowel  begins  almost  immediately.  These  characteristics  are  ob¬ 
servable  in  the  spectrograms  of  Fig.  22. 
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7.  UNSTRESSED  VOWELS  AND  VOWELS  WITH  SECONDARY  STRESS 

A  vowel  that  forms  the  nucleus  of  a  syllable  in  English  can  be 
assigned  a  feature  which  indicates  the  stress  of  the  syllable. 

It  is  possible  for  two  utterances  to  be  identical  in  all  respects 
except  the  stress  on  th^  syllabic  nuclei.  For  example,  the  noun 
and  verb  forms  of  the  word  reject  differ  only  in  the  stress  as¬ 
signed  to  the  two  syllables. 

Listeners  appear  to  be  able  to  assign  at  least  three  degrees  of 
prominence  to  vowels  that  form  the  nuclei  of  syllables  in  English. 
It  is  generally  assumed,  then,  that  a  syllable  can  be  character¬ 
ized  by  at  least  three  degrees  of  stress.  The  degree  of  stress  on 
a  vowel  determines,  to  some  extent,  the  acoustic  characteristics 
of  the  vowel,  particularly  its  duration,  its  fundamental  frequency, 
and  its  intensity.  As  has  been  noted  earlier,  however,  stress  on 
a  vowe^  aiso  has  a  marked  effect  on  the  preoerties  of  the  conso¬ 
nants  that  precede  and  follow  the  vowel. 

The  acoustic  characteristics  of  vowels  and  consonants  that  have 
been  discussed  up  to  this  point  were  obtained  for  syllables  with 
primary  stress,  i.e.,  with  the  highest  of  the  three  degrees  of 
stress.  We  use  the  term  secondary  stress  to  designate  the  degree 
of  stress  on  a  vowel  whose  quality  is  similar  to  that  of  a  vowel 
with  primary  stress  but  whose  prominence,  as  judged  by  listeners, 
is  less  than  that  of  some  other  vowel  in  the  same  utterance  that 
is  Judged  to  have  primary  stress.  Thus,  in  each  of  the  words  sea- 
weedj  essay ,  and  cocoa ,  the  second  vowel  is  considered  to  have 
secondary  stress.*  A  still  lower  degree  of  stress  is  assigned  to 

#This  designation  of  stress  is  not  entirely  in  accord  with  that 
of  others,. but  is  sufficient  for  our  purposes. 
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a  vowel  for  which  the  quality  is  changed  to  that  of  a  schwa  vowel 
/a/  by  virtue  of  this  stress  assignment.  The  vowel  in  this  case 
is  said  to  be  reduced.  Thus  the  second  vowel  in  the  word  famous 
is  generally  regarded  to  be  a  reduced  vowel.  For  a  reduced  vowel, 
it  is  not  necessary  to  specify  place  of  articulation,  since  two 
words  cannot  differ  only  in  the  place  of  articulation  of  a  reduced 
vowel. 

Other  examples  of  reduced  vowels  are  the  vowels  in  the  second 
syllables  of  the  words  poker,  bushel,  and  wagon.  The  unstressed 
vowels  in  these  words  are  often  designated  as  syllabic  /r/,  /l/, 
and  /n/,  respectively,  although  it  may  be  more  appropriate  to  rep¬ 
resent  such  vowels  as  a  schwa  vowel  /a/  followed  by  a  final  conso¬ 
nant  /r/,  /!/,  or  /n/. 


7.1  Duration  and  fundamental  frequency 

In  general,  unstressed  vowels  tend  to  he  shorter  and  of  lower  am¬ 
plitude  than  stressed  vowels,  but  the  words  studied  here  do  not 
provide  enough  examples  of  the  vowels  to  permit  quantitative  data 
on  these  durations  and  amplitudes  to  be  tabulated.  Durations  of 
reduced  vowels  in  syllable-final  position  in  bisyllabic  words  Qre 
in  the  range  50—150  msec.  Vowels  with  secondary  stress  tend  to  be 
slightly  longer.  When  a  reduced  vowel  occurs  in  initial  position 
In  a  bisyllabic  utterance  (as  in  the  utterances  /a'CVC/  or  as  in  a 
word  like  about  or  alike),  the  duration  can  become  as  short  as 
20  msec,  and  the  vowel  may  consist  of  Just  two  or  three  glottal 
vibrations.  In  fact,  it  is  not  uncommon  for  such  a  vowel  to  be 
omitted  altogether  in  rapid  speech. 
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Fundamental  frequency  for  vowels  in  unstressed  syllables  tends  to 
be  lower  than  for  stressed  vowels.  Although  detailed  measurements 
of  fundamental  frequency  were  not  made  in  this  study,  informal  ob¬ 
servations  of  spectrograms  support  this  result,  which  has  often 
been  reported  in  other  studies.  (See,  for  example,  Lieberman, 
1967. )  Neither  fundamental  frequency  nor  duration  provide  a  reli¬ 
able  procedure,  however,  for  identifying  the  degree  of  stress  that 
is  assigned  to  a  vowel. 


7.2  Spectra  of  vowels  with  secondary  stress 

Examples  of  spectra  for  nonreduced  vowels  generated  by  one  of  the 
speakers  are  shown  in  Fig.  23.  Comparison  of  these  spectra  with 
those  shown  earlier  for  stressed  vowels  (Fig.  5)  indicates  that 
the  unstressed  vowels  have  similar  frequency  characteristics.  The 
peak  amplitude  for  the  unstressed  vowel  tends  to  be  lower  than 
that  for  the  stressed  vowel  in  the  same  word,  but  the  amplitude 
differences  between  stressed  and  unstressed  vowels  are  by  no  means 
-insistent  for  all  three  speakers.  While  the  spectrum  shape  is 
similar  for  the  same  vowel  with  primary  and  with  secondary  stress, 
the  spectrum  of  the  latter  sometimes  has  weaker  low-frequency  en¬ 
ergy  relative  to  high-frequency  energy.  This  effect  may,  however, 
be -due  to  the  fact  that  the  vowel  with  'secondary  stress  is  always 
in  utterance-final  position  for  the  words  examined  here. 
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7.3  Spectra  of  reduced  vowels 

Spectra  sampled  within  several  reduced  vowels  of  various  types  are 
plotted  in  Fig.  24.  All  of  these  vowel  spectra  are  characterized 
by  a  low-frequency  peak  at  filter  2  or  at  filter  3*  The  schwa 
vowel  in  these  examples  has  a  second-formant  peak  at  filter  8 
(1520),  as  is  characteristic  of  such  neutral  vowels,  but  the  posi¬ 
tion  of  the  second  formant  for  such  a  vowel  may  vary  appreciably 
with  context.  The  final  reduced  vowel  (in  the  word  ragged)  appears 
to  be  richer  in  high-frequency  energy  than  the  initial  reduced 

vowel  (in  the  utterance  /a 'dad/).  Syllabic  /I  r  n/  have  mu^h  less 

•  •  • 

high-frequency  energy  than  the  schwa  vowel.  All  of  these  syllabic 
sounds  appear  to  have  more  high-frequency  energy  than  their  conso¬ 
nantal  counterparts  that  occur  in  prestressed  position.  Presumably 
the  syllabic  sounds  tend  to  be  generated  with  a  more  open  vocal- 
tract  configuration,  and  are  therefore  characterized  by  a  higher 
first-formant  frequency  and  possibly  by  a  vocal-cord  source  spec¬ 
trum  chat  is  richer  in  high  frequencies.  The  schwa  vowel  in  pre¬ 
stressed  position  (as  in  /a 'dad/),  on  the  other  hand,  is  probably 
generated  with  more  constricted  vocal-tract  configuration  and 
with  a  vocal-cord  pulse  shape  that  is  broader.  Both  of  these  ef¬ 
fects  would  tend  to  reduce  the  amount  of  high-frequency  energy  rel¬ 
ative  to  the  energy  at  low  frequencies. 
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8.  CONSONANTS  IN  POSTSTRESSED  POSITIONS 

The  acoustic  cnaracteristics  of  consonants  in  poststressed  posi¬ 
tions  can  be  altered  drastically  relative  to  those  in  prestressed 
position.  A  complete  discussion  of  all  the  ways  in  which  modifi¬ 
cations  of  consonant  properties  can  occur  cannot  be  given  in  this 
report,  but  a  few  examples  can  be  cited. 

When  nasal  or  stop  consonants  occur  after  stressed  vowels  and  pre¬ 
ceding  unstressed  vowels,  the  duration  of  the  closure  interval  may 
be  greatly  reduced.  Values  in  the  range  15—110  msec  are  observed 
on  spectrograms,  as  opposed  to  110— 130  msec  when  the  consonants 
are  in  prestressed  position  (in  the  environment  /a'CV/).  Examples 
of  these  brief  closure  Intervals  are  shown  in  the  spectrograms  in 
Fig.  25.  These  duratlonr  are  particularly  short  (lb-50  msec)  for 
dental  consonants  /d  t  n/  followed  by  reduced  voweis.  The  dura¬ 
tion  of  aspiration  at  the  release  of  a  stop  consonant  into  an  un¬ 
stressed  vowel  is  also  greatly  reduced,  and  in  some  cases  a  voice¬ 
less  stop  consonant  may  have  essentially  no  aspiration  in  this  en¬ 
vironment  . 


Likewise  the  durations  of  fricative  consonants  are  reduced  in  a 
poststressed  environment  preceding  an  unstressed  vowel,  although 
the  reduction  is  not  as  marked  as  it  is  with  some  stop  and  nasal 
consonants.  Durations  of  voiceless  fricatives  are  in  the  range 
110— 150  msec,  as  opposed  to  about  180—200  msec  in  a  prestressed 
environment;  voiced  fricative  durations  are  70—120  msec,  in  con¬ 
trast  to  130  msec  or  more  in  a  prestressed  environment. 

The  spectra  of  consonants  in  poststressed  position  preceding  un¬ 
stressed  vowels  are  considerably  different  from  spectra  of  the 
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corresponding  prestressed  consonants.  Voiced  consonants  exhibit 
much  more  high-frequency  energy  during  the  closure  interval,  ar.d 
also  have  a  greater  overall  amplitude.  Figure  26,  for  example, 
shows  spectra  sampled  during  the  closure  interval  for  several 
voiced  fricatives,  nasals,  liquids,  and  glides.  Also  plotted  on 
these  graphs  are  examples  of  spectra  for  the  same  consonants  in 
prestressed  position  (shown  previously  in  Figs.  10,  14,  and  16). 

The  differences  are  presumably  due  in  part  to  the  fact  that  frica¬ 
tives,  glides,  and  liquids  are  not  as  tightly  constricted  in  post- 
stressed  position  as  in  prestressed  position,  with  the  result  that 
the  low-frequency  peak  is  not  as  low  in  frequency.  It  may  also 
happen  that  the  spectrum  of  vocal-cord  vibration  is  richer  in  high- 
frequency  energy  for  the  poststressed  consonants.  Voiceless  stop 
consonants  in  poststressed  position,  on  the  other  hand,  are  charac¬ 
terized  by  spectra  that  are  quite  similar  to  those  in  prestressed 
position. 


For  some  of  the  blsyllaoic  utterances  with  stop  consonants  in  post- 
stressed  position,  there  is  evidence  from  the  spectrograms  that  the 
stop  closure  was  not  complete.  This  is  particularly  true  of  velar 
stop  consonants  (/kg/).  In  other  words,  the  feature  stop  is  often 
not  clearly  registered  in  the  acoustic  signal  for  this  phonetic  en¬ 
vironment,  except  that  the  closure  interval  for  these  stops  is  gen¬ 
erally  shorter  than  that  for  fricatives. 


The  acoustic  properties  of  the  release  of  various  consonants  into 
unstressed  vowels  have  not  been  examined  in  detail  in  this  study. 
It  is  evident,  however,  that  the  kinds  of  data  that  have  been 
shown  earlier  for  prestressed  consonants  are  substantially  modi¬ 
fied  and  blurred  when  the  consonants  precede  unstressed  vowels. 
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FREQUENCY  (kHz) 


Fig.  26  Spectra  of  some  voiced  consonants  during 
the  closure  interval  (solid  lines).  The 
consonants  are  In  poststressed  position 
preceding  reduced  vowels.  Shown  for  com¬ 
parison  are  spectra  of  the  same  consonants 
In  the  prestressed  environment  /o' Co/ 
(dashed  lines).  Speaker  KS. 
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The  speech  material  used  in  this  study  provides  many  examples  of 
consonants  in  word-final  position.  The  durations  of  these  conso¬ 
nants,  which  occur  at  the  ends  of  breath  groups  in  all  of  the 
utterances,  are  subject  to  considerable  variability.  Their  spec¬ 
tral  characteristics  show  many  features  in  common  with  the  spectra 
of  consonants  in  poststressed  position,  discussed  above. 
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9.  CONCLUHING  REMARKS 

The  data  that  have  been  presented  in  this  study  provide  an  indi¬ 
cation  of  the  acoustic  attributes  of  some  of  the  features  of  vowels 
and  consonants  in  various  phonetic  environments.  More  detail  has 
been  given  for  the  characteristics  of  segments  as  they  occur  in 
stressed  syllables,  since  it  is  assumed  that  this  phonetic  environ¬ 
ment  provides  the  clearest  indication  of  the  ideal  acoustic  attri¬ 
butes  of  the  features.  Certain  of  the  features  are  sufficiently 
well  understood  that  their  characteristics  can  be  described  in  de¬ 
tail;  others  have  not  yet  been  adequately  analyzed  and  documented. 
Some  of  examples  of  the  way  in  which  the  acoustic  properties  are 
modified  in  other  kinds  of  phonetic  environments  have  also  been 
presented,  but  it  has  been  possible  to  consider  only  a  rather  lim¬ 
ited  range  of  situations. 

What  is  needed  in  the  future  is  a  general  set  of  rules  that  de¬ 
scribe  how  the  acoustic  characteristics  of  phonetic  segments  change 
in  various  environments,  particularly  in  utterances  consisting  of 
several  syllables.  From  the  fragmentary  data  presented  here  and 
from  other  data,  it  seems  evident  that  in  utterances  of  several 
syllables,  in  which  vowels  are  assigned  various  levels  of  stress, 
the  stress  pattern  exerts  a  controlling  influence  on  the  timing 
and  durations  of  events  within  the  utterances  and  on  the  degree  of 
precision  with  which  certain  segments  are  actualized.  A  detailed 
specification  of  these  influences  cannot  yet  be  made,  however, 
since  the  timing  and  rhythm  associated  with  various  stress  patterns 
is  not  understood. 


It  must  be  emphasized  again  (as  it  was  in  Section  1)  that  the 
acoustic  properties  of  a  phonetic  segment  in  an  utterance  are 
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influenced  not  only  by  the  phonetic  environment  in  which  the  seg¬ 
ment  occurs,  but  also  by  semantic  and  syntactic  factors.  Reso¬ 
lution  of  inadequate  acoustic  information  in  the  signal  through 
recourse  to  semantic,  syntactic,  or  even  lexical  considerations  is 
a  task  which  a  speech  recognizer  will  probably  not  be  able  to  ac¬ 
complish  in  the  near  future,  at  least  in  any  general  way. 
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